

************************************************************
************************************************************
***       Retrieving Panel Data from the SOEP            ***
***   3: Creating a Person-Year File (Long Format)       ***
***          Josef Brderl, Volker Ludwig                *** 	   
***                    May 2012                          *** 	   
************************************************************
************************************************************

* Data: SOEP 1984-2010 v27

* Set system memory and define working directories
clear 
set more off
version 12

global pfad1 "C:\SOEP\V27\"        //directory of original data 
global pfad2 "C:\SOEP\work\"       //working directory



***************************************************************
*** 1. STEP: CREATING A MASTER-FILE FROM $PGEN          *******
***          CONTAINING ALL PERSON-YEARS IN THE SOEP    *******
***************************************************************

local year=1984                       //preparing $PGEN and saving these files
foreach wave in a b c d e f g h i j k l m n o p q r s t u v w x y z ba{
	cd $pfad1
	use `wave'pgen.dta, clear
	ren persnr id                      //person identifier should be "id"
	gen year=`year'                    //this will be our wave identifier
	ren month month                    //"month84" is now "month" (and so on)
	ren lfs lfs                        //"lfs84" is now "lfs" (and so on)
	ren nation nation                  //"nation84" is now "nation" (and so on)
	ren `wave'famstd famstd            //"afamstd" is now "famstd" (and so on)
	keep id year hhnr hhnrakt month   ///these vars one should pull always
	             lfs famstd nation     //these vars are optional
	cd $pfad2
	save `wave'work.dta, replace      //save the prepared files
	local year=`year'+1
}
use awork.dta, clear                  //pool all years
foreach wave in b c d e f g h i j k l m n o p q r s t u v w x y z ba{
	append using `wave'work.dta
}
save master.dta, replace           //save master file


***************************************************************
*** 2. STEP: PULL VARIABLES FROM PPFAD                  *******
***************************************************************

cd $pfad1
use ppfad.dta, clear             //load PPFAD
ren persnr id                    //our person identifier is "id"
keep id psample sex gebjahr     ///these vars one should pull always
        loc1989 migback          //these vars are optional
cd $pfad2
merge 1:m id using master.dta    //merge with (multiple) person-years from master.dta
drop if _merge==1                //drop persons without person-year (mainly kids)
drop _merge                      //_merge has to be deleted before the next merge

save master.dta, replace              //replace master file


***************************************************************
*** 3. STEP: PULL VARIABLES FROM $PEQUIV                *******
***          IF NOT NEEDED, SIMPLY DELETE THIS SECTION  *******
***************************************************************
* Note: $PEQUIV is convenient to pull 
*     - income variables (incl. consumer price index)
*     - houshold level variables (incl. equivalence scale stuff)
*     - health information
*     - happiness
*     - weights (w111*)

local year=1984                        //preparing $PEQUIV and saving these files
foreach wave in a b c d e f g h i j k l m n o p q r s t u v w x y z ba{
	cd $pfad1
	use `wave'pequiv.dta, clear
	ren persnr id                      //our person identifier is "id"
	gen year=`year'                    //this is our wave identifier
	ren i11102 hhinc1                  //hh income
	ren l11101 bula                    //state of residence
	ren l11102 east                    //residence in the east
    ren d11104 marstat                 //marital status
	keep id year                      ///these vars one must pull always
	 hhinc1 bula east marstat          //these vars are optional
	cd $pfad2
	save `wave'work.dta, replace       //save the prepared files
	local year=`year'+1
}
use awork.dta, clear                   //pool all years
foreach wave in b c d e f g h i j k l m n o p q r s t u v w x y z ba{
	append using `wave'work.dta
}
merge 1:1 id year using master.dta   //merge with master.dta
drop if _merge==1                    //drop person-years only in $PEQUIV (mainly kids)
drop _merge                          //_merge has to be deleted before the next merge

save master.dta, replace              //replace master file


***************************************************************
*** 4. STEP: PULL VARIABLES FROM $P                     *******
***          IF NOT NEEDED, SIMPLY DELETE THIS SECTION  *******
***************************************************************

local year=1984                        //preparing $P and saving these files
foreach wave in a b c d e f g h i j k l m n o p q r s t u v w x y z ba{
	cd $pfad1
	use `wave'p.dta, clear
	ren persnr id                      //our person identifier is "id"
	gen year=`year'                    //this is our wave identifier
    //the following varlist was obtained from SOEPinfo by paste and copy
	foreach var in ap6801 bp9301 cp9601 dp9801 ep89 fp108 gp109 hp10901 ip10901   ///
	    jp10901 kp10401 lp10401 mp11001 np11701 op12301 pp13501 qp14301           ///
		rp13501 sp13501 tp14201 up14501 vp154 wp142 xp149 yp15501 zp15701 bap160 {
		capture rename `var' happy     //rename to happy, do not stop if error
	}
	keep id year                      ///these vars one must pull always
	        happy                      //these vars are optional
	cd $pfad2
	save `wave'work.dta, replace       //save the prepared files
	local year=`year'+1
}

*****************************************************************
* Unfortunately some person-years are not in $P, but in special data sets
* With happiness this is the case with GPOST and $PAGE17
* If you do not want to pull from these data sets delete this section
cd $pfad1
use gpost.dta, clear     //GPOST (East 1990) needs special treatment
ren persnr id 
gen year=1990 
rename gp6401e happy     //rename to happy 	
keep id year happy 
cd $pfad2
save gework.dta, replace

local year=2006                        //preparing $PAGE17 and saving these files
foreach wave in w x y z ba{
	cd $pfad1
	use `wave'page17.dta, clear
	ren persnr id                      //our person identifier is "id"
	gen year=`year'                    //this is our wave identifier
 	foreach var in wj98 xj99 yj99 zj99 baj99 {
		capture rename `var' happy     //rename to happy, do not stop if error
	}
	keep id year happy
	cd $pfad2
	save `wave'ywork.dta, replace      //save the prepared files
	local year=`year'+1
}
*******************************************************************

use awork.dta, clear                //pool all years
foreach wave in b c d e f g h i j k l m n o p q r s t u v w x y z ba{
	append using `wave'work.dta
}
capture append using gework.dta            //match East 1990
capture append using wywork.dta            //match youth 2006
capture append using xywork.dta            //match youth 2007
capture append using yywork.dta            //match youth 2008
capture append using zywork.dta            //match youth 2009
capture append using baywork.dta           //match youth 2010


merge 1:1 id year using master.dta //merge with master.dta
drop _merge                        //_merge has to be deleted before the next merge

sort hhnrakt year              //master.dta needs to be sorted for merge in 5. step?
save master.dta, replace       //replace master file


***************************************************************
*** 5. STEP: PULL VARIABLES FROM $HGEN                  *******
***          IF NOT NEEDED, SIMPLY DELETE THIS SECTION  *******
***************************************************************
* Information on households

local year=1984                        //preparing $HGEN and saving these files
foreach wave in a b c d e f g h i j k l m n o p q r s t u v w x y z ba{
	cd $pfad1
	use `wave'hgen.dta, clear
	gen year=`year'                    //this will be our wave identifier
	ren ahinc  hhinc2                  //"ahinc84" is now "hhinc2" (and so on)
	ren i1hinc hhinc3                  //"i1hinc84" is now "hhinc3" (and so on)
	keep hhnrakt year                 ///these vars one must pull always
	        hhinc2 hhinc3              //these vars are optional
	cd $pfad2
	save `wave'work.dta, replace       //save the prepared files
	local year=`year'+1
}
use awork.dta, clear                   //pool all years
foreach wave in b c d e f g h i j k l m n o p q r s t u v w x y z ba{
	append using `wave'work.dta
}
merge 1:m hhnrakt year using master.dta  //merge with master.dta
drop _merge                          //_merge has to be deleted before the next merge
* Here we have _merge==2 (N=140): person-years with a hhnrakt for which no
* corresponding household-year exists in $HGEN. Note that hhinc2/3 are "." for these 
* cases, whereas hhinc1 (from $PEQUIV) has a valid value (often 0). 
* There must be something wrong here!?

save master.dta, replace              //replace master file


***************************************************************
***    STEP: PULL WEIGHTS FROM PHRF                     *******
***          IF NOT NEEDED, SIMPLY DELETE THIS SECTION  *******
***************************************************************
* Person weights 
* It is suggested that you pull weights from $PEQUIV if needed
* (variables w111*)


***************************************************************
***    STEP: RESTRICTING THE SAMPLE A LA SOEPinfo       *******
***          IF NOT NEEDED, SIMPLY DELETE THIS SECTION  *******
***************************************************************
* SOEPinfo allows to 
*  a) retrieve balanced samples                  (very seldomly used)
*  b) retrieve only person-years in private hh   (not recommended)
*  c) use only certain subsamples                (in special situations)

/*
* a) Retrieve a balanced sample
cd $pfad1
use ppfad.dta, clear                  //load PPFAD
ren persnr id                         //our person identifier is "id"
keep if ((anetto>0&anetto<40) &       ///balanced sample waves a and b
         (bnetto>0&bnetto<40))
keep id 
cd $pfad2
merge 1:m id using master.dta    //merge with (multiple) person-years from master.dta
drop if _merge<3                 //drop persons not in samples a and b
drop _merge                      //_merge has to be deleted before the next merge
keep if year<=1985               //keep only relevant person-years
save master.dta, replace         //replace master file
*/

/*
* b) Delete person-years not in private household
cd $pfad1
use ppfad.dta, clear                  //load PPFAD
ren persnr id                         //our person identifier is "id"
local year=1984                      
foreach wave in a b c d e f g h i j k l m n o p q r s t u v w x y z ba{
		rename `wave'pop  pop`year'   //renaming to "pop*"
		local year=`year'+1
}
keep id pop*
reshape long pop, i(id) j(year)       //reshape to long format
cd $pfad2
merge 1:1 id year using master.dta  //merge with person-years from master.dta
drop if _merge==1                   //drop person-years without interview
drop _merge                         //_merge has to be deleted before the next merge
keep if pop<=2                      //keep only person-years in private households
save master.dta, replace            //replace master file
*/

/*
* c) Delete persons from specific subsamples
drop if psample==7   //high-income sample
*/


***************************************************************
*** 6. STEP: SAVE FINAL DATA SET                        *******
***************************************************************
compress
sort  id year
order id year hhnr hhnrakt month psample gebjahr sex
save ANALYSISDATASET.dta, replace              //CHOOSE YOUR NAME!!!!

* Delete auxiliary files
foreach wave in a b c d e f g ge h i j k l m n o p q r s t u v w wy x xy y yy z zy ba bay{
	capture erase `wave'work.dta
}
erase master.dta


***************************************************************
*** 7. STEP: SOME BASIC INFORMATION ON THE DATA SET     *******
***************************************************************

* Tabulating central variables (and getting N*T and N)
bysort id (year): gen pynr = _n //person-years numbered consecutively (within person)
tab happy                       //person-years
tab happy  if pynr==1           //persons


* Separating EAST/WEST
tab east      //time-varying east-west indicator
tab loc1989   //time-constant location 1989


* Separating GERMANS/FOREIGNER
tab nation   //time-varying nationality
tab migback  //time-constant migration background


* Always check, whether variables from $P files are coded longitudinally consistent!!
*    bysort year: tab happy
* Note that value labels are from the first matching data set (here ap.dta). Thus,
* it is not easy to see changes in coding with this procedure. Therefore, it is
* better to check longitudinal consistency within SOEPinfo (put all vars to the
* basket and get frequencies)


* Should I use vars from $PGEN or $PEQUIV?
* Many variables are in both data sets. For instance, marital status is measured by
* "famstd" (PGEN) and "marstat" (PEQUIV). According to the Data Documentation, marstat
* is a recode of famstd. It is often the case that PEQUIV variables are derived from
* PGEN variables. Then it is recommended that you use the PGEN version. (PEQUIV vars
* often have an "american" flavor).
tab1 famstd marstat   //there are unexplainable differences --> use famstd


* Should I use hhinc from $PEQUIV or $HGEN?
* hhinc1 from PEQUIV (i11102$$) (includes some imputation)
* hhinc2 from HGEN   (ahinc$$)  (from hhquestionnaire but corrected)
* hhinc3 from HGEN   (i1hinc$$) (imputed since 1995 only)
tab hhinc1 if (hhinc1<10 | hhinc1>1000000) & year>=1995, m
tab hhinc2 if (hhinc2<10 | hhinc2>100000)  & year>=1995, m
tab hhinc3 if (hhinc3<10 | hhinc3>100000)  & year>=1995, m
* hhinc1 uses imputation. Therefore, if you want to minimize missings use it.
* If you are not convinced about imputation use hhinc2 (the "corrected" original).

* Does imputation of hhinc work well?
sort hhnrakt year
mvdecode hhinc?, mv(-1=.a \ -2=.b \ -3=.c)
list hhnrakt year hhinc2 hhinc3 if hhnrakt==27&id==201  //I am not sure!!??
* However, hhinc3 correlates better with hhinc2
corr hhinc? if year>=1995


